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Kernel  probability  density  estimates  can  be  used  to  construct  a  test  of 
the  hypothesis  that  the  density  underlying  a  given  univariate  data  set  has  at 


most  k  modes,  for  any  given  k  >  1.  The  test  is  based  on  the  critical  value 

'  j 

of  the  smoothing  parameter  for  k  modes  to  occur  in  the  estimate.  The 
theoretical  properties  of  this  test  are  investigated;  the  asymptotic 
properties  of  the  test  statistic  show  that  the  test  is  consistent.  Further¬ 
more  the  rate  of  convergence  of  the  test  statistic  to  zero  gives  some 
theoretical  insight  into  a  bootstrap  technique  previously  suggested  by  the 
author,  and  also  into  observed  properties  of  kernel  density  estimates. 
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SIGNIFICANCE  AND  EXPLANATION 


An  important  question  in  cluster  analysis  is  the  determination  of  the 
number  of  clusters  into  which  a  given  population  should  be  divided.  This 
problem  arises  in  almost  every  area  where  data  are  collected,  for  example  in 
physics,  geology,  medicine  and  psychology. 

Frequently,  particularly  when  certain  specific  clustering  methods  are 
being  used,  the  number  of  clusters  is  taken  to  be  equal  to  the  number  of 
modes,  or  local  maxima,  in  the  probability  density  function  underlying  the 
given  data  set.  The  author  has  previously  suggested  a  technique  for 
investigating  the  number  of  modes  underlying  a  given  population.  In  this 
paper,  the  mathematical  properties  of  this  procedure  are  investigated.  The 
results  obtained  confirm  various  intuitive  remarks  made  in  the  original 
presentation  of  the  method,  and  also  suggest  that  the  technique  may  cast  light 
on  another  important  problem,  that  of  determining  how  much  to  smooth  a  sample 
in  order  to  estimate  its  underlying  probability  density. 


The  responsibility  for  the  wording  and  views  expressed  in  this  descriptive 
summary  lies  with  MAC,  and  not  with  the  author  of  this  report. 


ON  A  TEST  FOR  MULTIMODALITY  BASED  ON 


KERNEL  DENSITY  ESTIMATES 

* 

B.  W.  Silverman 

1 .  Introduction 

Silverman  (1981)  suggested  and  illustrated  a  way  that  kernel  probability 
density  estimates  can  be  used  to  investigate  the  number  of  modes  in  the 
density  underlying  a  given  independent  identically  distributed  real  sample. 
Given  an  independent  sample  X1,...,Xn  from  a  univariate  probability 
density  f ,  define  the  kernel  density  estimate  fR  with  Gaussian  kernel  by 

n  . 

f  (t,h)  =  l  h  <(>{(t-X.  )/h}  f 
n  i-1  1 

where  the  parameter  h  is  the  smoothing  parameter  or  window  width  and  $  is 
the  standard  normal  density  function.  Kernel  density  estimates  were 
introduced  by  Rosenblatt  (1956)  and  Parzen  (1962);  the  restriction  to  Gaussian 
kernels  in  this  work  is  made  for  reasons  given  in  Silverman  (1981).  Often  the 
explicit  dependence  of  fR  on  h  will  be  suppressed. 

Consider  the  problem  of  testing  the  null  hypothesis  that  f  has  k  or 
fewer  modes  against  the  alternative  that  f  has  more  than  k  modes.  The 
statistic  suggested  for  constructing  such  a  test  was  the  k-critical  window 
width  hcrit^k^'  defined  by 

h  .^(k)  «  inf{h  :  f  (»,h)  has  at  most  k  modes} 
crit  n 
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In  Silverman  (1981)  it  was  stated  heuriutically  that  large  values  of  hcrit 
will  tend  to  reject  the  null  hypothesis*  The  results  of  this  paper  show  that 
this  procedure  does  indeed  lead  to  a  consistent  test. 

Subject  to  certain  regularity  conditions,  it  is  shown  that,  under  the 
null  hypothesis,  hcrit  converges  stochastically  to  zero,  while  this  is  not 
the  case  under  the  alternative  hypothesis*  The  exact  rate  of  convergence  of 
hcrit  to  zero  under  the  null  hypothesis  is  found*  It  is  perhaps  interesting 
that  this  rate  of  convergence  has  precisely  the  same  order  as  the  rate  of 
convergence  for  the  optimum  choice  of  window  width  for  the  uniform  estimation 
of  the  density  given,  for  example,  by  Silverman  (1978b). 

In  Silverman  (1981)  a  smoothed  bootstrap  procedure  for  assessing  the 
significance  of  an  observed  value  of  hcrit  was  suggested  and  illustrated  by 
an  application.  The  representative  of  the  null  hypothesis  constructed  from 
the  data  is  obtained  from  the  density  estimate  with  window  width  hcrit»  the 
estimate  is  rescaled,  as  suggested  by  Efron  (1979),  to  have  variance  equal  to 
the  sample  variance  of  the  data.  The  remarks  above  show  that  fR(* 'hcrit^ 
is,  in  a  certain  sense,  optimally  uniformly  consistent  as  an  estimate  of  the 
true  density  f.  It  follows  that,  on  the  null  hypothesis,  the  bootstrap 
procedure  is  likely,  at  least  for  large  samples,  to  provide  an  estimate  of  the 
true  underlying  density  which  is  accurate  in  the  uniform  norm.  A  possible 
drawback  for  small  samples  is  the  fact  that  the  implied  constant  in  the  rate 
of  convergence  does  not  necessarily  take  its  optimum  value. 

An  interesting  open  question  raised  by  this  discussion  is  the  possibility 
of  using  hcri*.(k)  for  some  value  of  k  in  developing  an  automatic  method 
for  choosing  the  smoothing  parameter  in  density  estimation.  Boneva,  Kendall 
and  stefanov  (1971)  suggested  choosing  the  window  width  where  'rabbits1  or 
rapid  fluctuations  just  started  to  appear.  Such  a  window  width  would  perhaps 


correspond  to  hcrit(k)  for  some  k  >  j»  since  hcrit(k)  converges  to  sero 
at  the  optimum  rate  for  all  k  >  j,  a  suitable  formalisation  of  the  Boneva- 
Kendall-Stefanov  procedure  would  give  estimates  which  converged  at  the  optimal 
rate,  though  not  necessarily  with  the  optimal  constant  multiplier*  The  fact 
that  hcrit(k)  has  the  same  rate  of  convergence  for  all  k  >  j  provides  some 
explanation  for  the  observation  made  by  Boneva,  Kendall  and  Stef anov  that  the 
estimate  seems  suddenly  to  become  noisy  as  the  window  width  is  reduced* 

The  use  of  kernel  density  estimates  in  mode  estimation  was  originated  by 
Parzen  (1962).  The  'gradient  method'  of  cluster  analysis  is  based  on 
clustering  towards  modes  in  the  estimated  density*  see,  for  example,  Andrews 
(1972),  Fukunaga  and  Hostetler  (1975),  and  Bock  (1977).  Papers  related  to 
t*3ts  of  multimodality  are  Cox  (1966)  and  Good  and  Gaskins  (1980). 


3- 


2.  The  main  result 


It  is  convenient  to  prove  the  various  assertions  of  the  theorem 
separately.  Except  where  otherwise  stated,  the  conditions  of  the  theorem  on 
f  will  be  assumed  to  be  true  throughout.  The  first  proposition  facilitates 
the  proof  of  ( 2 )  • 

Proposition  1 .  Given  any  c1  with 

0  <  c.  <7  it/2  c  , 

I  J  o 

suppose  the  sequence  of  window  widths  satisfies 

n-1o(h  )  -*■  c,  .  (5) 

n  i 

Then  the  number  of  maxima  of  f  tends  in  probability  to  j . 

It  follows  from  Proposition  1  and  Silverman  (1981)  that,  for  all  k  >  j, 
provided  (5)  holds. 


P(horit(M  <  hn)  .  . 
and  hence  that  (2)  is  satisfied. 

The  proof  of  Proposition  1  makes  use  of  several  lemmas,  the  first  of 
which  shows  that,  under  certain  conditions,  maxima  and  minima  of  fn  can, 
eventually,  only  occur  arbitrarily  close  to  those  of  f. 

Lemma  1.  Let  I  be  any  closed  interval  contained  in  [a,b] ,  such  that  I 
contains  none  of  the  zeroes  of  f ' .  Then ,  provided  h^  ♦  0  and 

n  ^^ath  )  ♦  0,  it  will  follow  that 

n  n  --- 

P(f  monotonic  on  I  in  the  same  sense  as  f)  +  1  • 

n 

Proof.  By  slight  adaptation  of  the  results  of  Silverman  (1978a),  it  can  be 
seen  that,  provided  f  is  bounded,  we  will  have,  if  hjj  satisfies  the 
assumptions  of  Proposition  1, 


[SUP. 


|f’  -  Ef’| 
n  n 


O  (n 

“P 


2 

2-1 


h  'a(h  )2} 
n  n 


-  o  (1) 
-p 


(6) 
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In  Silverman  ( 1978a)  the  uniform  continuity  of  f  was  additionally  assumed, 
but  careful  examination  of  the  proofs  of  that  paper  shows  that  the  derivation 
of  the  rate  of  stochastic  convergence,  though  not  of  the  exact  constant 
implied  in  the  Op,  goes  through  under  the  assumption  of  bounded  f. 

Supposing  without  loss  of  generality  that  f  is  increasing  on  I,  it 
follows  from  the  continuity  of  f‘  on  [a,b]  that  f'  is  bounded  away  from 
zero  on  I  and  is  non-negative  on  a  neighborhood  of  I,  and  hence  by 
elementary  analysis  that 

lim  inf  inf  Ef'  >  0  .  (7) 

_  n 


Combining  (6)  and  (7)  completes  the  proof  of  Lemma  1. 

The  next  lemma  shows  that,  under  suitable  conditions,  fn  will 

eventually  have  exactly  one  maximum  and  no  minima  near  each  maximum  of  f, 

and  exactly  one  minimum  and  no  maxima  near  each  minimum  of  f. 

Lemma  2.  Suppose  f'(z)  =  0  and  f  has  a  local  maximum  (respectively 

minimum)  at  z.  Suppose  h^  ♦  0  and 

n~1a(h  )  c.  e  (0,  ~  w/2  f " ( z)2/f  ( z) )  .  (8) 

n  i  o 

Then,  for  all  sufficiently  small  e  >  0,  the  probability  that  f^  has 
exactly  one  zero  in  (z-e,  z+e ) ,  and  that  this  zero  is  a  maximum 
( respectively  minimum)  of  fn,  tends  to  one  as  n  tends  to  infinity. 

Proof .  Only  the  case  of  a  local  maximum  will  be  considered.  The  proof  for  a 
minimum  proceeds  very  similarly  and  is  omitted.  Throughout  this  proof 
unqualified  infima  and  suprema  will  be  taken  to  be  over  x  in  [z-e,  z+e]. 

By  the  continuity  of  f  and  f",  choose  e  sufficiently  small  that 


2  3c 

inf  f"(x)  2 

sup  f ( x )  2xv^ 


(9) 
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and  also  [z-e,  z+e]  c  (a,b).  It  is  then  immediate  that  f'(z-e)  >  0  and 
f'{z+e)  <  0  since,  by  (9),  fN  cannot  cross  zero  in  (z-e,  z+e).  Since 
f*  is  continuous  at  z  ±  e,  by  standard  results  on  the  consistency  of 
(a  combination  of  Parzen  (1962)  and  Bhattacharya  (1967)) 

p{ f ' (z-e )  >  0  and  f'(z+e)  <  0}  ♦  1  .  (10) 

n  n 

Very  slightly  adapting  the  proofs  of  Silverman  (1976  and  1978a)  to  cope 
with  the  fact  that  f"  is  only  uniformly  continuous  on  a  neighborhood  of 
[z-e,  z+e]  gives 

J.  _1 

n  2<x(h)2  sup |  f ” ( x)  -  Ef " ( x)  |  ?  K. 

n  n  l 


where 


K2  *  2  sup  f  /  <|>"2 

*  3(2it/2)  1  sup  f  . 

Since,  by  elementary  analysis,  sup|Ef"(x)  -  f"(x) |  converges  to  zero,  it 

n 

2 

follows  from  (8)  that  p  lim  sup  sup|f"(x)  -  f (x) |  <  K,c_ 

n  n  12 

<  inf |f"(x) | 

by  (9).  It  is  immediate  that 

p{f"(x)  <  0  for  all  x  in  [z-e,  z+e]}  1  .  (11) 

n 

Combining  (10)  and  (11)  completes  the  proof  of  Lemma  2. 

To  complete  the  proof  of  Proposition  1,  note  first  that  no  maxima  of  f 
can  occur  outside  the  interval  (a,b).  Let  z.j , . . . ,  z2  be  the  zeroes  of 

f'  in  (a,b)  and  choose  e  sufficiently  small  to  satisfy  the  conclusion  of 


(12) 


Lemma  2  for  all  z^  and  to  ensure  that 

a  <  zre  <  z1+e  <  z2-e  <— <  *2j-1+e  <  b  . 

Applying  either  Lemma  1  or  Lemma  2  as  appropriate  to  each  of  the  intervals  in 
the  partition  (12)  of  the  interval  (a,b)  completes  the  proof  of  Proposition 
1. 


The  next  proposition  leads  to  the  proof  of  assertion  (3),  in  a  similar 
way  to  the  derivation  of  (2)  from  Proposition  1. 

Proposition  2 

Defining  a  as  in  ( 1 )  above,  suppose  that 

n  ^otOi  )  08  and  n  'h  5  +  0  .  (13) 

n  n 

Then  the  number  of  maxima  in  fn  tends  in  probability  to  infinity. 

Given  any  k,  it  follows  from  this  result  and  the  corollary  of  Silverman 
(1981)  that,  provided  (13)  holds. 


F(h=rlt(k>  »  hn>  *  1  1 

assertion  (2)  follows  at  once. 

To  prove  Proposition  2,  suppose  without  loss  of  generality  that  f  has  a 

maximum  at  0  in  (a,b).  Choose  a  sequence  l  which  satisfies 

n 


l  ♦0,  h1*  =o{n1o(h)}  , 

n  n  n  —  n 

(14) 

h  'i  •  and  I  log  l  |  |log  h  |  1  ♦  1  . 

n  n  n  n 

The  explicit  dependence  of  h  and  l  on  n  will  often  be  suppressed.  Let 

L  _  be  the  interval  [(j-1)£,  for  integer  j  >  0. 

j , n 

Following  Silverman  (1978a)  apply  Theorem  3  of  Komlos,  Major  and  Tusnady 


(1975)  to  obtain 


f'(x)  =  Ef'(x)  +  h  ^  2p,(x)  +  e'(x) 
n  n  In 
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where  p1  is  a  Gaussia:.  ^ocess  with  the  same  covariance  structure  as 

1_ 

2 

n  h(f'  -  Ef')  and  e'  is  a  secondary  random  error.  The  process  p  is 
n  n  n  1 

obtained  by  putting  6(u)  equal  to  <J> *  ( u)  in  Proposition  1  of  Silverman 

(1978a).  By  elementary  analysis  and  the  arguments  of  Silverman  (1978a)  we 

have,  in  a  neighborhood  of  0, 

| Ef  * ( x)  -  f*  <x) |  =  0(h)  ; 

n  — 

|e'(x'|  =  0(n  'h  2log  n)  a.s. 

n  — 

2 

=  o(h  )  from  (13)  above  ; 
and  | f 1 (x) 1  =  0(x)  , 

since  f’(0)  =  0  and  f"  exists.  It  follows  that,  a.s., 

sup|Ef’(x)  +  e'(x))  =  0(ji)  +  0(h) 
n  n  —  — 

1  (15) 

=  o{n  ^ h  5log(£/h)}2 

by  (13)  and  (14)  above,  where  we  adopt  the  convention,  here  and  subsequently 
in  this  proof,  that  unqualified  suprema  are  taken  to  be  over  the  interval 
Ij  n'  and  that  a  fixed  j  is  being  considered. 

We  slightly  adapt  the  argument  of  Silverman  (1976)  pp.  138-140  to 
investigate  sup  p^.  Define 

o2(x)  =  var  p.j(x)  =  h  ^(x)  J  $,2(1  +  o(  1 )  ) 

=  h  ^(0)  /  <Ji,2(1  +  o(1))  for  x  in  I .  , 

—  )>n 

since  the  end  points  of  Ij  n  both  converge  to  zero.  Analogously  to  (12)  of 
Silverman  (1976),  given  any  A  in  (0,2), 


(16) 


P (sup  a  1p1  <  ( 1  -  j  A )  {2  log  (h-1A)}2] 

<  0(A~2)log<h_1A) 

x  //  |x|exp{2  log(h”1A)(1  -|x)2|xl/(1  +  IxD) 
1  j*n 


where  x(x/y)  “  corr{p ( x) ,p (y) } .  Using  a  similar  argument  to  that  following 
(12)  of  Silverman  (1976)/  but  allowing  the  interval  I  to  vary,  shows  that 
the  expression  in  (16)  is  dominated  by 


(1  -  -1  X)2 

0(l~2)  logOT1!)  {o2(0)  +  o(l)}"1  {h"1!}  2  0(A) 

-i  -x  + 

-  (h  A)  *  log(h  A)  0 


by  (14)  above. 


A  ry 

It  follows  that,  setting  K  ■  {2f(0)  /  i  , 


p  lim  inf  supth  1  log(h  1A)}2p,)  >  K 


(17) 


and  that  the  same  result  holds  if  p 1  is  replaced  by  -p  ,  giving  a 
corresponding  result  for  inf  p^.  It  follows  from  (15),  (17)  and  the 
corresponding  result  for  inf  p 1  that 

2 

2 

P{p.  crosses  -n  h(Ef'  +  e')  in  I.  }  -*■  1  , 

i  n  n  j  ,n 

and  hence  that 

P{f'  crosses  zero  in  I.  }  ♦  1  .  (18) 

n  3,n 

Since  (18)  holds  for  all  j,  the  number  of  maxima  in  fn  tends  in 
probability  to  infinity,  completing  the  proof  of  Proposition  2. 
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The  final  proposition  of  this  section  deals  with  the  case  where  the 


alternative  hypothesis  is  true,  and  shows  that  hcr^t  will  remain  bounded 
away  from  zero. 

Proposition  3 

If  k  <  j  then  there  exists  a  constant  hD  >  0,  depending  on  f 
and  k,  such  that 


P{h 


crit 


(k)  >  h  }  +  1  . 

o 


Proof 

By  arguments  analogous  to  those  of  the  proof  of  the  theorem  of  Silverman 

(1981),  making  use  of  the  variation  diminishing  properties  of  the  Gaussian 

kernel  and  the  continuity  properties  of  Ef n,  the  number  of  maxima  in 

Ef  (*,h)  is  a  right  continuous  decreasing  function  of  h,  for  h  >  0.  By 
n 

choosing  h0  sufficiently  small,  we  can  ensure  that  Efn(»,hfl)  has, 

independently  of  n,  exactly  j  maxima.  Because  of  the  conditions  imposed 

on  f  in  the  statement  of  the  Theorem  above,  we  can  also  ensure  that 

Ef"(*,h  )  is  non-zero  at  all  stationary  points  of  Ef  (»,h  ). 
n  0  n  u 

The  argument  of  Lemma  2.2  of  Schuster  (1969),  which  does  not  in  fact 
require  the  convergence  to  zero  of  the  sequence  of  window  widths,  then  implies 
that,  with  probability  one, 


fnU'V  "  Ef;<x'V  and  fnU'V  "  Efn(x'V 

both  converge  to  zero  uniformly  over  x.  By  an  argument  similar  to  that  used 
in  Proposition  1  above,  it  follows  that  the  number  of  maxima  of  on 

[a,b]  tends  almost  surely  to  j,  the  number  of  maxima  of  Efn(»,hg). 
Applying  the  corollary  of  Silverman  (1981)  completes  the  proof  of  Proposition 


Discussion 


It  is  natural  to  enquire  to  what  extent  the  conditions  of  the  theorem 
above  can  be  r elected  without  affecting  the  conclusions.  In  particular  it 
seems  intuitively  clear  that  the  condition  of  bounded  support  for  the 
density  f  should  be  able  to  be  replaced  by  some  condition  on  the  tails  of 
f,  though  the  present  method  of  proof  cannot  deal  with  this  case.  Condition 
(iv)  appears  to  be  more  fundamental  to  the  result;  if,  for  example,  f'(0)  - 
f" (0 )  *  0  y  f"'{0),  then  2m  examination  of  fn  and  Efn  near  zero  seems  to 
indicate  that,  under  suitable  regularity  conditions,  there  will  be  no  maximum 
of  fn  near  zero  provided  |f^'  -  Ef^' |  remains  small.  A  heuristic  argument 
suggests  that  a  result  corresponding  to  the  theorem  of  Section  2  can  be 
proved,  but  with  a(h)  replaced  by  h-7log(h-1 ) ,  so  that  hcrit  converges 
to  zero  more  slowly.  Even  slower  convergence  will  occur  for  higher  order 
zeroes  in  f ' . 

The  interest  in  this  discussion  lies  in  the  fact  that  the  bootstrap 

density  constructed  using  the  critical  window  width  will  not  only  have 

infinite  tails  of  similar  weight  to  those  of  the  corresponding  normal  kernels 

but  will  also  have  a  stationary  point  which  is  a  point  of  inflexion.  The 

slower  convergence  to  zero  of  hcrit  provides  support  for  the  remark  of 

Silverman  (1981)  that  the  bootstrap  test  may  be  conservative;  it  also  bears 

out  the  intuition  of  P.  Huber  (private  communication)  that  the  bootstrap 

procedure  may  be  excessively  conservative,  though  the  difference  between 

1_  ± 

5  7 

n  and  n  convergence  is  very  slight  in  practice. 

The  methods  of  this  paper  can  also  be  used  to  study  the  asymptotic 
properties  of  a  corresponding  test  for  the  number  of  points  of  inflexion  in 
the  density.  Both  Cox  (1966)  and  Good  and  Gaskins  (1980)  prefer  to  use  points 
of  inflexion  as  an  indication  that  the  density  is  a  mixture.  The  critical 
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window  width  will  now  be  the  smallest  window  width  for  which  the  density  has 


k  maxima.  Under  suitable  conditions  a  result  corresponding  to  the  theorem  of 
Section  2  can  be  proved,  but  again,  among  other  changes,  a(h)  will  be 
replaced  by  h~^log(1/h)  since  fj|  will  be  replaced  by  f^'  in  much  of  the 
argument  of  the  proofs  of  Propositions  1  and  2. 
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